ASPEC: Asian Scientific Paper Excerpt Corpus
نویسندگان
چکیده
In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).
منابع مشابه
CKY-based Convolutional Attention for Neural Machine Translation
This paper proposes a new attention mechanism for neural machine translation (NMT) based on convolutional neural networks (CNNs), which is inspired by the CKY algorithm. The proposed attention represents every possible combination of source words (e.g., phrases and structures) through CNNs, which imitates the CKY table in the algorithm. NMT, incorporating the proposed attention, decodes a targe...
متن کاملTranslation Using JAPIO Patent Corpora: JAPIO at WAT2016
Japan Patent Information Organization (JAPIO) participates in scientific paper subtask (ASPEC-EJ/CJ) and patent subtask (JPC-EJ/CJ/KJ) with phrase-based SMT systems which are trained with its own patent corpora. Using larger corpora than those prepared by the workshop organizer, we achieved higher BLEU scores than most participants in EJ and CJ translations of patent subtask, but in crowdsourci...
متن کاملToshiba MT System Description for the WAT2014 Workshop
This paper provides a system description of Toshiba Machine Translation System for WAT2014. We participated in two tasks, namely Japanese-English translation and Japanese-Chinese translation. In each task, we submitted two results; one is a result of a rule-based translation system, and the other is a result which is an output of statistical post editing trained with the ASPEC training corpora....
متن کاملDeveloping Asian language corpora: standards and practice
This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University – the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and ...
متن کاملNarrowing the Readability Gap Between Scientific Papers and the World Wide Web
As of today, publications are treated as self-contained entities, with usually a few tens of references to relevant papers in the field. References have a restricted semantic: they can only point to papers as a whole, rather than to a specific portion of the document (as anchor hyperlinks can do with HTML pages). The restriction is due in part to LATEX–i.e., papers indeed are not hypertexts– al...
متن کامل